## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## [1] 0
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As can be seen above, the vast majority of wines are scored 5 or 6.
3’s and 4’s
## [1] 0.03939962
5’s and 6’s
## [1] 0.8248906
7’s and 8’s
## [1] 0.1357098
Scores of 5 and 6 account for over 82% percent of all scores! This suggests that the most useful information might be found by examining the lowest and highest scorers, but we’ll save that for later.
Examine correlations (technically these are bivariate plots, but the correlation coefficients are displayed nicely).
Alcohol, volatile.acidity, and sulphates present correlation coefficient values furthest from zero, so let’s examine these further.
There are 1,599 observations of red wines with 12 recorded features for each observation. Some of the features are related to each other (e.g., those related to acidity). Quality is the only categorical feature.
Alcohol, volatile.acidity, and sulphates seem to be the features most highly correlated with quality scores.
It was unclear what features will be useful at this point.
No, there didn’t seem to be much of a need to create new variables.
None of the features seemed unusual enough to explore futher and, no, I didn’t bother tidying/adjusting the form of the data at this point.
Continue to focus on alcohol, volatile.acidity, and sulphates.
quality_groups <- group_by(reds, quality)
grouped_reds <- summarise(quality_groups,
alcohol_mean = mean(alcohol),
volatile_acid_mean = mean(volatile.acidity),
sulphates_mean = mean(sulphates),
n = n())
grouped_reds <- arrange(grouped_reds, quality)
Simply put: alcohol, volatile.acidity, and sulphates (particularly the first two) appear to have an affect of the quality scores. Alcohol will be discussed below, but, in general, the lower the volatile acidity, the higher the quality score; the inverse is true for sulphates and quality score.
I didn’t bother looking at the other features because I’m focusing on answering the primary question driving this project.
Alcohol. Funny enough, a higher alcohol content seems to encourage a higher score.
As my examination continued, I felt better and better about the apparent relationship between alcohol, volatile.acidity, and sulphates and quality scores.
I found it interesting that sulphate levels seem to have a sweet spot when it comes to quality scores.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = reds)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = reds)
## m5: lm(formula = quality ~ alcohol:volatile.acidity:sulphates, data = reds)
## m6: lm(formula = quality ~ alcohol * volatile.acidity * sulphates,
## data = reds)
##
## ===================================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.611*** 5.763*** 1.285
## (0.175) (0.184) (0.196) (0.196) (0.058) (2.188)
## alcohol 0.361*** 0.314*** 0.309*** 0.309*** 0.426*
## (0.017) (0.016) (0.016) (0.016) (0.209)
## volatile.acidity -1.384*** -1.221*** -1.221*** 9.044*
## (0.095) (0.097) (0.097) (4.030)
## sulphates 0.679*** 0.679*** 2.713
## (0.101) (0.101) (3.226)
## alcohol x volatile.acidity x sulphates -0.036* 1.524*
## (0.015) (0.593)
## alcohol x volatile.acidity -0.996*
## (0.389)
## alcohol x sulphates -0.184
## (0.309)
## volatile.acidity x sulphates -15.622*
## (6.130)
## ---------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.003 0.351
## adj. R-squared 0.226 0.316 0.335 0.335 0.003 0.349
## sigma 0.710 0.668 0.659 0.659 0.806 0.652
## F 468.267 370.379 268.912 268.912 5.414 123.160
## p 0.000 0.000 0.000 0.000 0.020 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.384 -1923.929 -1580.453
## Deviance 805.870 711.796 692.105 692.105 1038.644 675.909
## AIC 3448.114 3251.628 3208.768 3208.768 3853.857 3178.905
## BIC 3464.245 3273.136 3235.654 3235.654 3869.988 3227.300
## N 1599 1599 1599 1599 1599 1599
## ===================================================================================================
Since model 6 had the highest R^2 value, I tested it with some obvious extreme cases (based on what seems to have been discovered above) using only values for alcohol, volatile.acidity, and sulphates:
## fit lwr upr
## 1 7.43747 6.1107 8.764241
## fit lwr upr
## 1 5.538351 4.259407 6.817296
## fit lwr upr
## 1 4.845372 3.271417 6.419327
As should be somewhat expected from the entire investigation so far, combined with the not-too-shabby R^2 value of the simple linear regression model we selected, these predictions were spot on.
Strength of this model: it works for the obvious cases. Weakness of this model: it’s unclear how robust it is.
This plot makes it easy to see the distribution of quality scores (most are in the middle), the rightward trend of alcohol content, the downward slope of volatile.acidity, and the mid-range sweet-spot of sulphates levels all in relation to quality scoring.
Although alcohol and quality are swapped from their perhaps expected axis locations, the swapping, along with the smoothing line, makes it clear that as quality increases, so do alcohol content (and, thus, the reverse relationship is true). Volatile.acidity and sulphates continue to play supporting roles.
Perhaps my favorite plot, borrowing from the experimentation with contours earlier, this plot, although leaving out sulphates, makes it clear that there are distinct clusters of quality scores that are quite obviously related to volatile.acidity and alcohol levels. If I were given a new red wine with only those two features listed, I would be very confident using merely this plot to predict the quality score (assuming the same wine experts responsible for this data set).
A fruitful exercise, this project exposed two or three features of red wines that, when related to one another, seem to lead to obvious groupings. Alcohol, volatile.acidity, and sulphates (in that order) appear to affect the (perceived) quality of red wines, at least among those wine experts consulted in the making of this data set.